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Abstract 

Integrating  information  from  different 
stages  of  an  NLP  processing  pipeline  can 
yield  significant  error  reduction.  We  dem¬ 
onstrate  how  re-ranking  can  improve  name 
tagging  in  a  Chinese  information  extrac¬ 
tion  system  by  incorporating  information 
from  relation  extraction,  event  extraction, 
and  coreference.  We  evaluate  three  state- 
of-the-art  re-ranking  algorithms  (MaxEnt- 
Rank,  SVMRank,  and  p-Norm  Push  Rank¬ 
ing),  and  show  the  benefit  of  multi-stage 
re-ranking  for  cross-sentence  and  cross¬ 
document  inference. 

1  Introduction 

In  recent  years,  re-ranking  techniques  have  been 
successfully  applied  to  enhance  the  performance 
of  NLP  analysis  components  based  on  generative 
models.  A  baseline  generative  model  produces  N- 
best  candidates,  which  are  then  re-ranked  using  a 
rich  set  of  local  and  global  features  in  order  to 
select  the  best  analysis.  Various  supervised  learn¬ 
ing  algorithms  have  been  adapted  to  the  task  of  re¬ 
ranking  for  NLP  systems,  such  as  MaxEnt-Rank 
(Charniak  and  Johnson,  2005;  Ji  and  Grishman, 
2005),  SVMRank  (Shen  and  Joshi,  2003),  Voted 
Perceptron  (Collins,  2002;  Collins  and  Duffy, 
2002;  Shen  and  Joshi,  2004),  Kernel  Based  Meth¬ 
ods  (Henderson  and  Titov,  2005),  and  RankBoost 
(Collins,  2002;  Collins  and  Koo,  2003;  Kudo  et  al., 
2005). 

These  algorithms  have  been  used  primarily 
within  the  context  of  a  single  NLP  analysis  com¬ 
ponent,  with  the  most  intensive  study  devoted  to 


improving  parsing  performance.  The  re-ranking 
models  for  parsing,  for  example,  normally  rely  on 
structures  generated  within  the  baseline  parser 
itself.  Achieving  really  high  performance  for  some 
analysis  components,  however,  requires  that  we 
take  a  broader  view,  one  that  looks  outside  a  sin¬ 
gle  component  in  order  to  bring  to  bear  knowl¬ 
edge  from  the  entire  NL  analysis  process.  In  this 
paper  we  will  demonstrate  the  potential  of  this 
approach  in  enhancing  the  performance  of  Chi¬ 
nese  name  tagging  within  an  information  extrac¬ 
tion  application. 

Combining  information  from  other  stages  in  the 
analysis  pipeline  allows  us  to  incorporate  informa¬ 
tion  from  a  much  wider  context,  spanning  the  en¬ 
tire  document  and  even  going  across  documents. 
This  will  give  rise  to  new  design  issues;  we  will 
examine  and  compare  different  re-ranking  algo¬ 
rithms  when  applied  to  this  task. 

We  shall  first  describe  the  general  setting  and 
the  special  characteristics  of  re-ranking  for  name 
tagging.  Then  we  present  and  evaluate  three  re¬ 
ranking  algorithms  -  MaxEnt-Rank,  SVMRank 
and  a  new  algorithm,  p-Norm  Push  Ranking  -  for 
this  problem,  and  show  how  an  approach  based  on 
multi-stage  re-ranking  can  effectively  handle  fea¬ 
tures  across  sentence  and  document  boundaries. 

2  Prior  Work 

2.1  Ranking 

We  will  describe  the  three  state-of-the-art  super¬ 
vised  ranking  techniques  considered  in  this  work. 
Later  we  shall  apply  and  evaluate  these  algorithms 
for  re-ranking  in  the  context  of  name  tagging. 

Maximum  Entropy  modeling  (MaxEnt)  has 
been  extremely  successful  for  many  NLP  classifi- 
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cation  tasks,  so  it  is  natural  to  apply  it  to  re¬ 
ranking  problems.  (Chamiak  and  Johnson,  2005) 
applied  MaxEnt  to  improve  the  performance  of  a 
state-of-art  parser;  also  in  (Ji  and  Grishman,  2005) 
we  used  it  to  improve  a  Chinese  name  tagger. 

Using  SVMRank,  (Shen  and  Joshi,  2003) 
achieved  significant  improvement  on  parse  re¬ 
ranking.  They  compared  two  different  sample 
creation  methods,  and  presented  an  efficient  train¬ 
ing  method  by  separating  the  training  samples  into 
subsets. 

The  last  approach  we  consider  is  a  boosting- 
style  approach.  We  implement  a  new  algorithm 
called  p-Norm  Push  Ranking  (Rudin,  2006).  This 
algorithm  is  a  generalization  of  RankBoost 
(Freund  et  al.  1998)  which  concentrates  specifi¬ 
cally  on  the  top  portion  of  a  ranked  list.  The  pa¬ 
rameter  “p”  determines  how  much  the  algorithm 
concentrates  at  the  top. 

2.2  Enhancing  Named  Entity  Taggers 

There  have  been  a  very  large  number  of  NE  tagger 
implementations  since  this  task  was  introduced  at 
MUC-6  (Grishman  and  Sundheim,  1996).  Most 
implementations  use  local  features  and  a  unifying 
learning  algorithm  based  on,  e.g.,  an  HMM,  Max¬ 
Ent,  or  SVM.  Collins  (2002)  augmented  a  baseline 
NE  tagger  with  a  re-ranker  that  used  only  local, 
NE-oriented  features.  Roth  and  Yih  (2002)  com¬ 
bined  NE  and  semantic  relation  tagging,  but 
within  a  quite  different  framework  (using  a  linear 
programming  model  for  joint  inference). 

3  A  Framework  for  Name  Re-Ranking 

3.1  The  Information  Extraction  Pipeline 

The  extraction  task  we  are  addressing  is  that  of  the 
Automatic  Content  Extraction  (ACE)1  evaluations. 
The  2005  ACE  evaluation  had  7  types  of  entities, 
of  which  the  most  common  were  PER  (persons), 
ORG  (organizations),  LOC  (natural  locations)  and 
GPE  (‘geo-political  entities’  -  locations  which  are 
also  political  units,  such  as  countries,  counties, 
and  cities).  There  were  6  types  of  semantic  rela¬ 
tions,  with  18  subtypes.  Examples  of  these  rela¬ 
tions  are  “the  CEO  of  Microsoft”  (an 
organization-affiliation  relation),  “Fred’s  wife”  (a 

1  The  ACE  task  description  can  be  found  at 
http://www.itl.nist.gov/iad/894.01/tests/ace/ 


personal-social  relation),  and  “a  military  base  in 
Germany”  (a  located  relation).  And  there  were  8 
types  of  events,  with  33  subtypes,  such  as  “Kurt 
Schork  died  in  Sierra  Leone  yesterday”  (a  Die 
event),  and  “Schweitzer  founded  a  hospital  in 
1913”  (a  Start-Org  event). 

To  extract  these  elements  we  have  developed  a 
Chinese  information  extraction  pipeline  that  con¬ 
sists  of  the  following  stages: 

•  Name  tagging  and  name  structure  parsing 
(which  identifies  the  internal  structure  of  some 
names); 

•  Coreference  resolution,  which  links  "men¬ 
tions"  (referring  phrases  of  selected  semantic 
types)  into  "entities":  this  stage  is  a  combina¬ 
tion  of  high-precision  heuristic  rules  and 
maximum  entropy  models; 

•  Relation  tagging,  using  a  K-nearest-neighbor 
algorithm  to  identify  relation  types  and  sub- 
types; 

•  Event  patterns,  semi-automatically  extracted 
from  ACE  training  corpora. 

3.2  Hypothesis  Representation  and  Genera¬ 
tion 

Again,  the  central  idea  is  to  apply  the  baseline 
name  tagger  to  generate  N-Best  multiple  hypothe¬ 
ses  for  each  sentence;  the  results  from  subsequent 
components  are  then  exploited  to  re-rank  these 
hypotheses  and  the  new  top  hypothesis  is  output 
as  the  final  result. 

In  our  name  re-ranking  model,  each  hypothesis 
is  an  NE  tagging  of  the  entire  sentence.  For  ex¬ 
ample,  “<PER> John</PER>  was  born  in 
<GPE>New  York</GPE>.”  is  one  hypothesis 
for  the  sentence  “John  was  born  in  New  York”. 

We  apply  a  HMM  tagger  to  identify  four  named 
entity  types:  Person,  GPE,  Organization  and  Loca¬ 
tion.  The  HMM  tagger  generally  follows  the 
Nymble  model  (Bikel  et  al,  1997),  and  uses  best- 
first  search  to  generate  N-Best  hypotheses.  It  also 
computes  the  “margin”,  which  is  the  difference 
between  the  log  probabilities  of  the  top  two  hy¬ 
potheses.  This  is  used  as  a  rough  measure  of  con¬ 
fidence  in  the  top  hypothesis.  A  large  margin 
indicates  greater  confidence  that  the  first  hypothe¬ 
sis  is  correct.  The  margin  also  determines  the 
number  of  hypotheses  ( N)  that  we  will  store.  Us¬ 
ing  cross-validation  on  the  training  data,  we  de¬ 
termine  the  value  of  N  required  to  include  the  best 
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hypothesis,  as  a  function  of  the  margin.  We  then 
divide  the  margin  into  ranges  of  values,  and  set  a 
value  of  N  for  each  range,  with  a  maximum  of  30. 

To  obtain  the  training  data  for  the  re-ranking 
algorithm,  we  separate  the  name  tagging  training 
corpus  into  k  folders,  and  train  the  HMM  name 
tagger  on  k-1  folders.  We  then  use  the  HMM  to 

generate  N-Best  hypotheses  H  =  {hIt  h2 . hNj  for 

each  sentence  in  the  remaining  folder.  Each  ht  in 
H  is  then  paired  with  its  NE  F-measure,  measured 
against  the  key  in  the  annotated  corpus. 

We  define  a  “crucial  pair”  as  a  pair  of  hypothe¬ 
ses  such  that,  according  to  F-Measure,  the  first 
hypothesis  in  the  pair  should  be  more  highly 
ranked  than  the  second.  That  is,  if  for  a  sentence, 
the  F-Measure  of  hypothesis  /?,  is  larger  than  that 
of  hj,  then  (hu  hj)  is  a  crucial  pair. 

3.3  Re-Ranking  Functions 

We  investigated  the  following  three  different  for¬ 
mulations  of  the  re-ranking  problem: 

•  Direct  Re-Ranking  by  Score 

For  each  hypothesis  hh  we  attempt  to  learn  a  scor¬ 
ing  function / :  H  R,  such  that  f(hi)  >  f(hj)  if  the 
F-Measure  of  ht  is  higher  than  the  F-measure  of  hj. 

•  Direct  Re-Ranking  by  Classification 

For  each  hypothesis  hb  we  attempt  to  leam  / :  H 
->  {-1,  1},  such  that  f(hj)  =  1  if  ht  has  the  top  F- 
Measure  among  H;  otherwise  f(hj)  =  -1.  This  can 
be  considered  a  special  case  of  re-ranking  by 
score. 

•  Indirect  Re-Ranking  Function 

For  each  “crucial”  pair  of  hypotheses  (hu  hj),  we 
learn  f :  H  X  //  ->  {-1,  1},  such  that  f(h„  hj)  =  1  if 
hj  is  better  than  hj;  f  (hb  hj)  =  - 1  if  ht  is  worse  than 
hj.  We  call  this  “indirect”  ranking  because  we 
need  to  apply  an  additional  decoding  step  to  pick 
the  best  hypothesis  from  these  pair-wise  compari¬ 
son  results. 

4  Features  for  Re-Ranking 

4.1  Inferences  From  Subsequent  Stages 

Information  extraction  is  a  potentially  symbiotic 
pipeline  with  strong  dependencies  between  stages 
(Roth  and  Yih,  2002&2004;  Ji  and  Grishman, 
2005).  Thus,  we  use  features  based  on  the  output 


of  four  subsequent  stages  -  name  structure  parsing, 
relation  extraction,  event  patterns,  and  coreference 
analysis  -  to  seek  the  best  hypothesis. 

We  included  ten  features  based  on  name  struc¬ 
ture  parsing  to  capture  the  local  information 
missed  by  the  baseline  name  tagger  such  as  details 
of  the  structure  of  Chinese  person  names. 

The  relation  and  event  re-ranking  features  are 
based  on  matching  patterns  of  words  or  constitu¬ 
ents.  They  serve  to  correct  name  boundary  errors 
(because  such  errors  would  prevent  some  patterns 
from  matching).  They  also  exert  selectional  pref¬ 
erences  on  their  arguments,  and  so  serve  to  correct 
name  type  errors.  For  each  relation  argument,  we 
included  a  feature  whose  value  is  the  likelihood 
that  relation  appears  with  an  argument  of  that  se¬ 
mantic  type  (these  probabilities  are  obtained  from 
the  training  corpus  and  binned).  For  each  event 
pattern,  a  feature  records  whether  the  types  of  the 
arguments  match  those  required  by  the  pattern. 

Coreference  can  link  multiple  mentions  of 
names  provided  they  have  the  same  spelling 
(though  if  a  name  has  several  parts,  some  may  be 
dropped)  and  same  semantic  type.  So  if  the 
boundary  or  type  of  one  mention  can  be  deter¬ 
mined  with  some  confidence,  coreference  can  be 
used  to  disambiguate  other  mentions,  by  favoring 
hypotheses  which  support  more  coreference.  To 
this  end,  we  incorporate  several  features  based  on 
coreference,  such  as  the  number  of  mentions  re¬ 
ferring  to  a  name  candidate. 

Each  of  these  features  is  defined  for  individual 
name  candidates;  the  value  of  the  feature  for  a 
hypothesis  is  the  sum  of  its  values  over  all  names 
in  the  hypothesis.  The  complete  set  of  detailed 
features  is  listed  in  (Ji  and  Grishman,  2006). 

4.2  Handling  Cross-Sentence  Features  by 
Multi-Stage  Re-Ranking 

Coreference  is  potentially  a  powerful  contributor 
for  enhancing  NE  recognition,  because  it  provides 
information  from  other  sentences  and  even  docu¬ 
ments,  and  it  applies  to  all  sentences  that  include 
names.  For  a  name  candidate,  62%  of  its  corefer¬ 
ence  relations  span  sentence  boundaries.  How¬ 
ever,  this  breadth  poses  a  problem  because  it 
means  that  the  score  of  a  hypothesis  for  a  given 
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sentence  may  depend  on  the  tags  assigned  to  the 
same  names  in  other  sentences.2 

Ideally,  when  we  re-rank  the  hypotheses  for  one 
sentence  S,  the  other  sentences  that  include  men¬ 
tions  of  the  same  name  should  already  have  been 
re-ranked,  but  this  is  not  possible  because  of  the 
mutual  dependence.  Repeated  re-ranking  of  a  sen¬ 
tence  would  be  time-consuming,  so  we  have 
adopted  an  alternative  approach.  Instead  of  incor¬ 
porating  coreference  evidence  with  all  other  in¬ 
formation  in  one  re-ranker,  we  apply  two  re¬ 
rankers  in  succession. 

In  the  first  re-ranking  step,  we  generate  new 
rankings  for  all  sentences  based  on  name  structure, 
relation  and  event  features,  which  are  all  sentence- 
internal  evidence.  Then  in  a  second  pass,  we  ap¬ 
ply  a  re-ranker  based  on  coreference  between  the 
names  in  each  hypothesis  of  sentence  S  and  the 
mentions  in  the  top-ranking  hypothesis  (from  the 
first  re-ranker)  of  all  other  sentences.3  In  this  way, 
the  coreference  re-ranker  can  propagate  globally 
(across  sentences  and  documents)  high-confidence 
decisions  based  on  the  other  evidence.  In  our  final 
MaxEnt  Ranker  we  obtained  a  small  additional 
gain  by  further  splitting  the  first  re-ranker  into 
three  separate  steps:  a  name  structure  based  re¬ 
ranker,  a  relation  based  re-ranker  and  an  event 
based  re-ranker;  these  were  incorporated  in  an 
incremental  structure. 

4.3  Adding  Cross-Document  Information 

The  idea  in  coreference  is  to  link  a  name  mention 
whose  tag  is  locally  ambiguous  to  another  men¬ 
tion  that  is  unambiguously  tagged  based  on  local 
evidence.  The  wider  a  net  we  can  cast,  the  greater 
the  chance  of  success.  To  cast  the  widest  net  pos¬ 
sible,  we  have  used  cross-document  coreference 
for  the  test  set.  We  cluster  the  documents  using  a 
cross-entropy  metric  and  then  treat  the  entire  clus¬ 
ter  as  a  single  document. 

We  take  all  the  name  candidates  in  the  top  N 
hypotheses  for  each  sentence  in  each  cluster  T  to 
construct  a  “query  set”  Q.  The  metric  used  for  the 
clustering  is  the  cross  entropy  H(T,  d)  between  the 
distribution  of  the  name  candidates  in  T  and 


2  For  in-document  coreference,  this  problem  could  be  avoided  if  the  tagging  of 
an  entire  document  constituted  a  hypothesis,  but  that  would  be  impractical ...  a 
very  large  N  would  be  required  to  capture  sufficient  alternative  taggings  in  an 
N-best  framework. 

3  This  second  pass  is  skipped  for  sentences  for  which  the  confidence  in  the  top 
hypothesis  produced  by  the  first  re-ranker  is  above  a  threshold. 


document  d.  If  H(T,  d)  is  smaller  than  a  threshold 
then  we  add  d  to  T.  H(T,  d)  is  defined  by: 

H(T,d)  =  prob(T,x )  x  log  prob(d,x) . 

xgO 

We  built  these  clusters  two  ways:  first,  just 
clustering  the  test  documents;  second,  by  aug¬ 
menting  these  clusters  with  related  documents 
retrieved  from  a  large  unlabeled  corpus  (with 
document  relevance  measured  using  cross¬ 
entropy). 

5  Re-Ranking  Algorithms 

We  have  been  focusing  on  selecting  appropriate 
ranking  algorithms  to  fit  our  application.  We 
choose  three  state-of-the-art  ranking  algorithms 
that  have  good  generalization  ability.  We  now 
describe  these  algorithms. 

5.1  MaxEnt-Rank 

5.1.1  Sampling  and  Pruning 

Maximum  Entropy  models  are  useful  for  the  task 
of  ranking  because  they  compute  a  reliable  rank¬ 
ing  probability  for  each  hypothesis.  We  have  tried 
two  different  sampling  methods  -  single  sampling 
and  pairwise  sampling. 

The  first  approach  is  to  use  each  single  hy¬ 
pothesis  hi  as  a  sample.  Only  the  best  hypothesis 
of  each  sentence  is  regarded  as  a  positive  sample; 
all  the  rest  are  regarded  as  negative  samples.  In 
general,  absolute  values  of  features  are  not  good 
indicators  of  whether  a  hypothesis  will  be  the  best 
hypothesis  for  a  sentence;  for  example,  a  co¬ 
referring  mention  count  of  7  may  be  excellent  for 
one  sentence  and  poor  for  another.  Consequently, 
in  this  single-hypothesis-sampling  approach,  we 
convert  each  feature  to  a  Boolean  value,  which  is 
true  if  the  original  feature  takes  on  its  maximum 
value  (among  all  hypotheses)  for  this  hypothesis. 
This  does,  however,  lose  some  of  the  detail  about 
the  differences  between  hypotheses. 

In  pairwise  sampling  we  used  each  pair  of  hy¬ 
potheses  (hj,  hj)  as  a  sample.  The  value  of  a  fea¬ 
ture  for  a  sample  is  the  difference  between  its 
values  for  the  two  hypotheses.  However,  consid¬ 
ering  all  pairs  causes  the  number  of  samples  to 
grow  quadratically  (0(N2))  with  the  number  of 
hypotheses,  compared  to  the  linear  growth  with 
best/non-best  sampling.  To  make  the  training  and 
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test  procedures  more  efficient,  we  prune  the  data 
in  several  ways. 

We  perform  pruning  by  beam  setting,  removing 
candidate  hypotheses  that  possess  very  low  prob¬ 
abilities  from  the  HMM,  and  during  training  we 
discard  the  hypotheses  with  very  low  F-measure 
scores.  Additionally,  we  incorporate  the  pruning 
techniques  used  in  (Chiang  2005),  by  which  any 
hypothesis  with  a  probability  lower  than  a  times 
the  highest  probability  for  one  sentence  is  dis¬ 
carded.  We  also  discard  the  pairs  very  close  in 
performance  or  probability. 

5.1.2  Decoding 

If  /  is  the  ranking  function,  the  MaxEnt  model 
produces  a  probability  for  each  un-pruned  “cru¬ 
cial”  pair:  1 orob(f(hu  hj)  =  7),  i.e.,  the  probability 
that  for  the  given  sentence,  hi  is  a  better  hypothe¬ 
sis  than  hj.  We  need  an  additional  decoding  step  to 
select  the  best  hypothesis.  Inspired  by  the  caching 
idea  and  the  multi-class  solution  proposed  by 
(Platt  et  al.  2000),  we  use  a  dynamic  decoding 
algorithm  with  complexity  O(n)  as  follows. 

We  scale  the  probability  values  into  three  types: 
CompareResult  (hb  hj)  =  “better”  if  prob(f(hit  h,)  = 
1)  >  5  b  “worse”  if  prob(f(hu  hj)  =  1)  <5  2,  and 
“unsure”  otherwise,  where  8  [  ^  8  2. 4 

Prune 

for  i  =  1  to  n 
Num  =  0; 

for  j  =  1  to  n  and  jVi 

If  CompareResult(h„  hj)  =  “worse” 
Num++; 

if  Num>  (7  then  discard  hj  from  H 

Select 

Initialize:  i  =  1,  j  =  n 
while  (i<j) 

if  CompareResult(hi,  hj)  =  “better” 
discard  hj  from  H; 

j--; 

else  if  CompareResult(ht,  hj)  =  “worse” 
discard  hj  from  H; 

i++; 

else  break; 


^  In  the  final  stage  re-ranker  we  use  8  ,=  8  2  so  that  we  don’t  generate  the 
output  of  “unsure”,  and  one  hypothesis  is  finally  selected. 


Output 

If  the  number  of  remaining  hypotheses  in  H  is  1, 
then  output  it  as  the  best  hypothesis;  else  propa¬ 
gate  all  hypothesis  pairs  into  the  next  re-ranker. 

5.2  SVMRank 

We  implemented  an  SVM-based  model,  which 
can  theoretically  achieve  very  low  generalization 
error.  We  use  the  SVMLight  package  (Joachims, 
1998),  with  the  pairwise  sampling  scheme  as  for 
MaxEnt-Rank.  In  addition  we  made  the  following 
adaptations:  we  calibrated  the  SVM  outputs,  and 
separated  the  data  into  subsets. 

To  speed  up  training,  we  divided  our  training 
samples  into  k  subsets.  Each  subset  contains  N(N- 
l)/k  pairs  of  hypotheses  of  each  sentence. 

In  order  to  combine  the  results  from  these  dif¬ 
ferent  SVMs,  we  must  calibrate  the  function  val¬ 
ues;  the  output  of  an  SVM  yields  a  distance  to  the 
separating  hyperplane,  but  not  a  probability.  We 
have  applied  the  method  described  in  (Shen  and 
Joshi,  2003),  to  map  SVM’s  results  to  probabili¬ 
ties  via  a  sigmoid.  Thus  from  the  k'h  SVM,  we  get 
the  probability  for  each  pair  of  hypotheses: 
prob(fk(hi,hj)  - 1) , 

namely  the  probability  of  h;  being  better  than  hj. 
Then  combining  all  k  SVMs’  results  we  get: 

z  (A » hj )  =  II  Prob(fk  (hi  ’  hj )  =  1)  - 

k 

So  the  hypothesis  A,  with  maximal  value  is  cho¬ 
sen  as  the  top  hypothesis: 

argma x(QZ(A,.,A,)). 

h>  j 

5.3  P-Norm  Push  Ranking 

The  third  algorithm  we  have  tried  is  a  general 
boosting-style  supervised  ranking  algorithm  called 
p-Norm  Push  Ranking  (Rudin,  2006).  We  de¬ 
scribe  this  algorithm  in  more  detail  since  it  is  quite 
new  and  we  do  not  expect  many  readers  to  be  fa¬ 
miliar  with  it. 

The  parameter  “p”  determines  how  much  em¬ 
phasis  (or  “push”)  is  placed  closer  to  the  top  of  the 
ranked  list,  where  p>l.  The  p-Norm  Push  Ranking 
algorithm  generalizes  RankBoost  (take  p=l  for 
RankBoost).  When  p  is  set  at  a  large  value,  the 
rankings  at  the  top  of  the  list  are  given  higher  pri¬ 
ority  (a  large  “push”),  at  the  expense  of  possibly 
making  misranks  towards  the  bottom  of  the  list. 
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Since  for  our  application,  we  do  not  care  about  the 
rankings  at  the  bottom  of  the  list  (i.e.,  we  do  not 
care  about  the  exact  rank  ordering  of  the  bad  hy¬ 
potheses),  this  algorithm  is  suitable  for  our  prob¬ 
lem.  There  is  a  tradeoff  for  the  choice  of  p;  larger 
p  yields  more  accurate  results  at  the  very  top  of 
the  list  for  the  training  data.  If  we  want  to  consider 
more  than  simply  the  very  top  of  the  list,  we  may 
desire  a  smaller  value  of  p.  Note  that  larger  values 
of  p  also  require  more  training  data  in  order  to 
maintain  generalization  ability  (as  shown  both  by 
theoretical  generalization  bounds  and  experi¬ 
ments).  If  we  want  large  p,  we  must  aim  to  choose 
the  largest  value  of  p  that  allows  generalization, 
given  our  amount  of  training  data.  When  we  are 
working  on  the  first  stage  of  re-ranking,  we  con¬ 
sider  the  whole  top  portion  of  the  ranked  list,  be¬ 
cause  we  use  the  rank  in  the  list  as  a  feature  for 
the  next  stage.  Thus,  we  have  chosen  the  value 
Pi=4  (a  small  “push”)  for  the  first  re-ranker.  For 
the  second  re-ranker  we  choose  p2=16  (a  large 
“push”). 

The  objective  of  the  p-Norm  Push  Ranking  al¬ 
gorithm  is  to  create  a  scoring  function  f:  H~>R 
such  that  for  each  crucial  pair  (hb  hj),  we  shall 
have  f(hj)  >  f(hj).  The  form  of  the  scoring  function 
is  f(hi)  =  Y.akgk(hi),  where  gk  is  called  a  weak 
ranker:  gk  :  II  ->  [0,1].  The  values  of  a*  are  de¬ 
termined  by  the  p-Norm  Push  algorithm  in  an  it¬ 
erative  way. 

The  weak  rankers  gk  are  the  features  described 
in  Section  4.  Note  that  we  sometimes  allow  the 
algorithm  to  use  both  gk  and  g  ’i/hi)= I  -gi/hj  as 
weak  rankers,  namely  when  gk  has  low  accuracy 
on  the  training  set;  this  way  the  algorithm  itself 
can  decide  which  to  use. 

As  in  the  style  of  boosting  algorithms,  real¬ 
valued  weights  are  placed  on  each  of  the  training 
crucial  pairs,  and  these  weights  are  successively 
updated  by  the  algorithm.  Fligher  weights  are 
given  to  those  crucial  pairs  that  were  misranked  at 
the  previous  iteration,  especially  taking  into  ac¬ 
count  the  pairs  near  the  top  of  the  list.  At  each 
iteration,  one  weak  ranker  gk  is  chosen  by  the  al¬ 
gorithm,  based  on  the  weights.  The  coefficient  ak 
is  then  updated  accordingly. 

6  Experiment  Results 

6.1  Data  and  Resources 


We  use  100  texts  from  the  ACE  04  training  corpus 
for  a  blind  test.  The  test  set  included  2813  names: 
1126  persons,  712  GPEs,  785  organizations  and 
190  locations.  The  performance  is  measured  via 
Precision  (P),  Recall  (R)  and  F-Measure  (F). 

The  baseline  name  tagger  is  trained  from  2978 
texts  from  the  People’s  Daily  news  in  1998  and 
also  1300  texts  from  ACE  training  data. 

The  1,071,285  training  samples  (pairs  of  hy¬ 
potheses)  for  the  re-rankers  are  obtained  from  the 
name  tagger  applied  on  the  ACE  training  data,  in 
the  manner  described  in  Section  3.2. 

We  use  OpenNLP5  for  the  MaxEnt-Rank  ex¬ 
periments.  We  use  SVMllght  (Joachims,  1998)  for 
SVMRank,  with  a  linear  kernel  and  the  soft  mar¬ 
gin  parameter  set  to  the  default  value.  For  the  p- 
Norm  Push  Ranking,  we  apply  33  weak  rankers, 
i.e.,  features  described  in  Section  4.  The  number 
of  iterations  was  fixed  at  110,  this  number  was 
chosen  by  optimizing  the  performance  on  a  devel¬ 
opment  set  of  100  documents. 

6.2  Effect  of  Pairwise  Sampling 

We  have  tried  both  single-hypothesis  and  pairwise 
sampling  (described  in  section  5.1.1)  in  MaxEnt- 
Rank  and  p-Norm  Push  Ranking.  Table  1  shows 
that  pairwise  sampling  helps  both  algorithms. 
MaxEnt-Rank  benefited  more  from  it,  with  preci¬ 
sion  and  recall  increased  2.2%  and  0.4%  respec¬ 
tively. 


Model 

P 

R 

F 

MaxEnt- 

Rank 

Single  Sampling 

89.6 

90.2 

89.9 

Pairwise  Sampling 

91.8 

90.6 

91.2 

p-Norm 

Push 

Single  Sampling 

91.4 

89.6 

90.5 

Pairwise  Sampling 

91.2 

90.8 

91.0 

Table  1.  Effect  of  Pairwise  Sampling 


6.3  Overall  Performance 

In  Table  2  we  report  the  overall  performance  for 
these  three  algorithms.  All  of  them  achieved  im¬ 
provements  on  the  baseline  name  tagger.  MaxEnt 
yields  the  highest  precision,  while  p-Norm  Push 
Ranking  with  p2  =  16  yields  the  highest  recall. 

A  larger  value  of  “p”  encourages  the  p-Norm 
Push  Ranking  algorithm  to  perform  better  near  the 
top  of  the  ranked  list.  As  we  discussed  in  section 


5  http ://maxent. sourceforge.net/index.html 
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5.3,  we  use  pi  =  4  (a  small  “push”)  for  the  first  re¬ 
ranker  and  p2  =  16  (a  big  “push”)  for  the  second 
re-ranker.  From  Table  2  we  can  see  that  p2  =  16 
obviously  performed  better  than  p2  =  1 .  In  general, 
we  have  observed  that  for  p2  ^  16,  larger  p2  corre¬ 
lates  with  better  results. 


Model 

P 

R 

F 

Baseline 

87.4 

87.6 

87.5 

MaxEnt-Rank 

91.8 

90.6 

91.2 

SVMRank 

89.5 

90.1 

89.8 

p-Norm  Push  Ranking  (p2  =16) 

91.2 

90.8 

91.0 

p-Norm  Push  Ranking 
(p2=l,  RankBoost) 

89.3 

89.7 

89.5 

Table  2.  Overall  Performance 

The  improved  NE  results  brought  better  per¬ 
formance  for  the  subsequent  stages  of  information 
extraction  too.  We  use  the  NE  outputs  from  Max- 
Ent-Ranker  as  inputs  for  coreference  resolver  and 
relation  tagger.  The  ACE  value6  of  entity  detec¬ 
tion  (mention  detection  +  coreference  resolution) 
is  increased  from  73.2  to  76.5;  the  ACE  value  of 
relation  detection  is  increased  from  34.2  to  34.8. 

6.4  Effect  of  Cross-document  Information 

As  described  in  Section  4.3,  our  algorithm  incor¬ 
porates  cross-document  coreference  information. 
The  100  texts  in  the  test  set  were  first  clustered 
into  28  topics  (clusters).  We  then  apply  cross¬ 
document  coreference  on  each  cluster.  Compared 
to  single  document  coreference,  cross-document 
coreference  obtained  0.5%  higher  F-Measure,  us¬ 
ing  MaxEnt-Ranker,  improving  performance  for 
15  of  these  28  clusters. 

These  clusters  were  then  extended  by  selecting 
84  additional  related  texts  from  a  corpus  of  15,000 
unlabeled  Chinese  news  articles  (using  a  cross¬ 
entropy  metric  to  select  texts).  24  clusters  gave 
further  improvement,  and  an  overall  0.2%  further 
improvement  on  F-Measure  was  obtained. 


6.5  Efficiency 


Model 

Training 

Test 

MaxEnt-Rank 

7  hours 

55  minutes 

SVMRank 

48  hours 

2  hours 

p-Norm  Push  Ranking 

3.2  hours 

10  minutes 

Table  3.  Efficiency  Comparison 


^  The  ACE04  value  scoring  metric  can  be  found  at: 
http://www.nist.gov/speech/tests/ace/ace04/doc/ace04-evalplan-v7.pdf 


In  Table  3  we  summarize  the  running  time  of 
these  three  algorithms  in  our  application. 

7  Discussion 

We  have  shown  that  the  other  components  of  an 
IE  pipeline  can  provide  information  which  can 
substantially  improve  the  performance  of  an  NE 
tagger,  and  that  these  improvements  can  be  real¬ 
ized  through  a  variety  of  re-ranking  algorithms. 
MaxEnt  re-ranking  using  binary  sampling  and  p- 
Norm  Push  Ranking  proved  about  equally  effec¬ 
tive.7  p-Norm  Push  Ranking  was  particularly  ef¬ 
ficient  for  decoding  (about  10  documents  / 
minute),  although  no  great  effort  was  invested  in 
tuning  these  procedures  for  speed. 

We  presented  methods  to  handle  cross-sentence 
inference  using  staged  re-ranking  and  to  incorpo¬ 
rate  additional  evidence  through  document  clus¬ 
tering. 

An  N-best  /  re-ranking  strategy  has  proven  ef¬ 
fective  for  this  task  because  with  relatively  small 
values  of  N  we  are  already  able  to  include  highly- 
rated  hypotheses  for  most  sentences.  Using  the 
values  of  N  we  have  used  throughout  (dependent 
on  the  margin  of  the  baseline  HMM,  but  never 
above  30),  the  upper  bound  of  N-best  performance 
(if  we  always  picked  the  top-scoring  hypothesis) 
is  97.4%  recall,  96.2%  precision,  F=96.8%. 

Collins  (2002)  also  applied  re-ranking  to  im¬ 
prove  name  tagging.  Our  work  has  addressed  both 
name  identification  and  classification,  while  his 
only  evaluated  name  identification.  Our  re-ranker 
used  features  from  other  pipeline  stages,  while  his 
were  limited  to  local  features  involving  lexical 
information  and  'word-shape'  in  a  5 -token  window. 
Since  these  feature  sets  are  essentially  disjoint,  it 
is  quite  possible  that  a  combination  of  the  two 
could  yield  even  further  improvements.  His  boost¬ 
ing  algorithm  is  a  modification  of  the  method  in 
(Freund  et  al.,  1998),  an  adaptation  of  AdaBoost, 
whereas  our  p-Norm  Push  Ranking  algorithm  can 
emphasize  the  hypotheses  near  the  top,  matching 
our  objective. 

Roth  and  Yih  (2004)  combined  information 
from  named  entities  and  semantic  relation  tagging, 
adopting  a  similar  overall  goal  but  using  a  quite 
different  approach  based  on  linear  programming. 


The  features  were  initially  developed  and  tested  using  the  MaxEnt  re-ranker, 
so  it  is  encouraging  that  they  worked  equally  well  with  the  p-Norm  Push 
Ranker  without  further  tuning. 
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They  limited  themselves  to  name  classification, 
assuming  the  identification  given.  This  may  be  a 
natural  subtask  for  English,  where  capitalization  is 
a  strong  indicator  of  a  name,  but  is  much  less  use¬ 
ful  for  Chinese,  where  there  is  no  capitalization  or 
word  segmentation,  and  boundary  errors  on  name 
identification  are  frequent.  Expanding  their  ap¬ 
proach  to  cover  identification  would  have  greatly 
increased  the  number  of  hypotheses  and  made 
their  approach  slower.  In  contrast,  we  adjust  the 
number  of  hypotheses  based  on  the  margin  in  or¬ 
der  to  maintain  efficiency  while  minimizing  the 
chance  of  losing  a  high-quality  hypothesis. 

In  addition  we  were  able  to  capture  selectional 
preferences  (probabilities  of  semantic  types  as 
arguments  of  particular  semantic  relations  as 
computed  from  the  corpus),  whereas  Roth  and  Yih 
limited  themselves  to  hard  (boolean)  type  con¬ 
straints. 
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